This report explores a dataset describing the quality of 4,898 white wines based on the chemical properties of each wine.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median : 5.200 Median :0.04300 Median : 34.00 Median :134.0
## Mean : 6.391 Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :65.800 Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
This is a summary of the 13 variables that describe each of the 4,898 white wines. This also includes the structure of each variable in the dataset.
We see based on the distribution of quality it seems to be normal with a bulk of the wines having a quality between the 5-7 range. There are no wines with a quality score below 3, and a few wins with a quality score of 9.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The median fixed acidity in the white wines is 6.800 g/dm^3. Most wines have an acidity ranging from 6.30 to 7.30. You can also see there is an outlier that has a fixed acidity over 14.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The distribution of the volatile acidity is skewed right with a median value of 0.2600. A majority of the volatile acidity ranges fall between 0.21 - 0.32. You can see there are a few outliers at 0.9, 1.0, and 1.1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Most of the white wines have a citric acid ranging between 0.27 - 0.39 g/dm^3. The distribution is right skewed; however, you can see it peaks around the 0.48 range and has a few wines that has a citric acid value over 1.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The residual sugar distribution has a median values of 5.2 g/dm^3.It is a right skewed distribution with a long tail as you can there are multiple bars on the right skewing the data all the way to 65.8 g/dm^3.
##Chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The amount of chlorides in the white wines has a median value of 0.043 g/dm^3. It looks like a normal distribution around the peak but has a long tail on the right side as the maximum amount of chlorides in the dataset is 0.346 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The free sulfur dioxide concentrations distribution is also right skewed. The median value is 34.0 mg/dm^3 while the average value is 35.31 mg/dm^3.This is somewhat close however there is a huge gap between 145 and the max value of 289.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The total sulfur dioxide distribution is close to symmetrical as it has a median value of 134 mg/ dm^3 and the mean value is 138.4 mg/ dm^3. We can there are a few outliers that have a total sulfer dioxide concentration higher that 275 mg/ dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density of the white wines does not vary a lot, as most of the values are between 0.9917 and 0.9940. The distribution is close to symmetrical but there is a wine that has a maximum density of 1.0390 g/ cm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
All the wines have a low pH which means they are more acidic if the pH level is below 7 on a scale of 0-14.The distribution is symmetrical as the median value is 3.180 and the mean value is 3.188 which is almost exact.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The sulphates distribution is slightly right skewed. The median value of sulphates is 0.47. Most of the white wines have a concentration between 0.41 and 0.55.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The distribution of alcohol is right skewed. Based on the distribution the minimum alcohol level for white wine is 8%. The median value is 10.4% which is expected in white wines.
The dataset has 13 variables that explain 4,898 different white wines. One variable ‘X’ actually just numbers the wines from 1 to 4,898.
The main feature of the dataset that interests me is the quality rating of the wines.
I believe all the chemical tests may add support to the investigation. Each factor contributes to the overall flavor and quality of the wine. Some variable are may have a strong correlation such as total sulfur dioxide and free sulfur dioxide.
No new variables were created from the existing variables.
There were no unusual distributions in the dataset. The dataset was already tidy which makes it ideal to use in this situation.
## [1] "Median of fixed.acidity by quality:"
## white_wine$quality: 3
## [1] 7.3
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 6.9
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 6.8
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 6.8
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 6.7
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 6.8
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 7.1
We see a slight downward trend of higher quality with higher fixed acidity. We see that for quality ranging from 4-8 the acidity level the median values stay between 6.7 - 6.9. For the extreme cases of quality either a 3 or 9 the acidity levels are above 7.0 which show that acidity levels do not have a huge impact on quality.
## [1] "Median of volatile.acidity by quality:"
## white_wine$quality: 3
## [1] 0.26
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.32
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.28
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.25
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.25
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.26
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.27
Based on the distribution we can see a trend where a lower volatile acidity looks to mean a higher wine quality. This can be seen with some of the classes with lower observations between the 6-8 quality range.
## [1] "Median of citric.acid by quality:"
## white_wine$quality: 3
## [1] 0.345
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.29
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.32
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.32
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.31
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.32
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.36
We see that a higher citric acid seems to mean a higher quality wine. A wine quality of 4 has a median citric acid level of 0.29 g/ dm^3 while the quality of wines from 5-8 have a median level between 0.31 and 0.32. A quality level of 9 has a median citric acid value of 0.36 g/ dm^3.
## [1] "Median of residual.sugar by quality:"
## white_wine$quality: 3
## [1] 4.6
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 2.5
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 7
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 5.3
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 3.65
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 4.3
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 2.2
## [1] "Median of residual.sugar by quality:"
## white_wine$quality: 3
## [1] 4.6
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 2.5
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 7
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 5.3
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 3.65
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 4.3
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 2.2
After getting a better look at the distribution we see that residual sugar has a low impact in the quality of wine. We see there are peaks and troughs. For example a wine quality of 6 has a median residual sugar level of 5.3 g/ dm^3, a wine quality of 7 the median residual sugar level drops to 3.65 g/dm^3, the it picks back up at a wine quality of 8.
## [1] "Median of chlorides by quality:"
## white_wine$quality: 3
## [1] 0.041
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.046
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.047
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.043
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.037
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.036
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.031
## [1] "Median of chlorides by quality:"
## white_wine$quality: 3
## [1] 0.041
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.046
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.047
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.043
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.037
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.036
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.031
There is a slight relation between chlorides and quality. The less chlorides there are the higher the quality.
## [1] "Median of free.sulfur.dioxide by quality:"
## white_wine$quality: 3
## [1] 33.5
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 18
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 35
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 34
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 33
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 35
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 28
The wines that have a quality level between 5-8 seem to have a higher free sulfur dioxide than a quality level of 4 or 9.
Coming from the dataset description, SO2 is mostly undetectable in wine in low concentrations, but at free SO2 concentrations over 50 ppm (~ 50 mg/ dm^3), SO2 becomes evident in the nose and taste of wine.
## [1] "Median of total.sulfur.dioxide by quality:"
## white_wine$quality: 3
## [1] 159.5
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 117
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 151
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 132
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 122
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 122
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 119
The total sulfur dioxide has a similar relation as the free sulfur dioxide.The middle quality levels of 5-8 have a higher concentration than a level of 4 or 9. There is however a steady decrease in total sulfur dioxide concentrations from quality level 5 to higher levels.
## [1] "Median of density by quality:"
## white_wine$quality: 3
## [1] 0.994425
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.9941
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.9953
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.99366
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.99176
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.99164
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.9903
We see that lower density means overall higher quality. There is a slight uptick in density levels between a wine quality level of 4 to 5, then it decreases as the quality level increases.
## [1] "Median of pH by quality:"
## white_wine$quality: 3
## [1] 3.215
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 3.16
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 3.16
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 3.18
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 3.2
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 3.23
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 3.28
We see that as the pH level increases the quality increases as well. We will check the correlations between pH levels and acidity to see if there is a strong correlation.
## [1] "Median of sulphates by quality:"
## white_wine$quality: 3
## [1] 0.44
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.47
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.47
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.48
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.48
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.46
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.46
## [1] "Median of sulphates by quality:"
## white_wine$quality: 3
## [1] 0.44
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 0.47
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 0.47
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 0.48
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 0.48
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 0.46
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 0.46
We see that the sulphates concentration increases slightly as the quality increases; however, it does have a small drop to a concentration of 0.46 g/dm^3 at a quality level of 8 and 9.
## [1] "Median of alcohol by quality:"
## white_wine$quality: 3
## [1] 10.45
## ------------------------------------------------------------
## white_wine$quality: 4
## [1] 10.1
## ------------------------------------------------------------
## white_wine$quality: 5
## [1] 9.5
## ------------------------------------------------------------
## white_wine$quality: 6
## [1] 10.5
## ------------------------------------------------------------
## white_wine$quality: 7
## [1] 11.4
## ------------------------------------------------------------
## white_wine$quality: 8
## [1] 12
## ------------------------------------------------------------
## white_wine$quality: 9
## [1] 12.5
Other than a small drop in the quality rating of 5, as the alcohol content increases so does the quality of wine.
We can see as the pH increases the fixed acidity drops as the wines approach a neutral pH level of 7.
Correlation coefficient:
##
## Pearson's product-moment correlation
##
## data: pH and log10(fixed.acidity)
## t = -33.783, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4572280 -0.4117972
## sample estimates:
## cor
## -0.4347892
We see there is a weak negative correlation regarding fixed acidity and pH levels.
We can see the citric acid does not have a relation with the pH levels.
The volatile acidity does not have a relation with the pH level either.
We expect the density of wine to be close to that of water 1 g/cm^3 however it depends on the sugar content and alcohol included in it.
We see there is an increase in density as the residual sugar increases.
While there is a decrease in density as the alcohol content increases.
##
## Pearson's product-moment correlation
##
## data: residual.sugar and alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
I was surprised with the correlation, I expected to see a stronger correlation between the alcohol content and the residual sugars. The reason behind this expectation is because alcohol is formed from the fermentation of sugars in grapes in regards to wine.
We are not aware of what grapes were used as each type of grape may yield different sugar contents.
Sulphates are wine additives that contribute to sulfur dioxide gas levels.
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and sulphates
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1069590 0.1619585
## sample estimates:
## cor
## 0.1345624
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and sulphates
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03126264 0.08707928
## sample estimates:
## cor
## 0.05921725
Based on the correlation coefficient we see there is almost no relation between sulphate levels and sulfur dioxide.
We see that the higher the wine quality is impacted by the amount of residual sugars and alcohol in it. I have documented the correlation coefficients of the other variables in regards to quality below.
## [,1]
## fixed.acidity -0.08448545
## volatile.acidity -0.19656168
## citric.acid 0.01833273
## residual.sugar -0.08206979
## chlorides -0.31448848
## free.sulfur.dioxide 0.02371338
## total.sulfur.dioxide -0.19668029
## density -0.34835102
## pH 0.10936208
## sulphates 0.03331897
## alcohol 0.44036918
I was surprised that there was not a stronger relationship between the residual sugars and alcohol levels. Mainly because alcohol comes from the fermentation of sugars.
The variable that had the strongest relationship with quality was the alcohol content level.
To assist with this section we will go ahead and make a correlation matrix:
Out of the various variables, alcohol and density strongly correlate with quality .
These plots show the steps I used to include the relationship of a third variable in the plots of quality compared to alcohol.Alcohol is on the x-axis as it has the strongest correlation with quality. The plots show the positive correlation between alcohol and quality while also showing the weak correlations with pH and residual sugar.
The plots show that as the alcohol levels increased along with a lower level of residual sugar increased the overall quality level.
I was surprised that with a higher amount of citric acid and higher alcohol content does not exactly mean a higher quality score.
We see an big impact of the alcohol level on the quality of wines. For the quality ranges from 3-5 there is a dip, but as the alcohol level increases after the slight dip the quality rating jumps.
We can see the distribution of the fixed acidity concentration across the pH levels. As the fixed acidity levels decrease the pH levels increase. This makes sense as the pH scale is from 0 to 14, 0 being very acidic like battery acid to 7 which is neutral similar to water, all the way up to 14 which is aklaline close to a drain cleaner fluid.
We see that there is an impact on the residual sugar concentration and density. The more residual sugar the higher the density which is also evident by the outliers on the graph.
The project was an enjoyable opportunity to apply the knowledge from learning R into a real word application. Starting from just comparing the one variable at a time to multivariate plots show that some variables impact each other which in turn impacts the quality. Some struggles I went through was correcting some of the errors that popped up when running the code. Something that went well was pulling the data and analyzing the data in various columns such as the quantiles, mean, and median for the alcohol level.
Some future work that could be done would be to incorporate the red wine dataset in this exploration to see if the quality of red wines is impacted the same as white wines or if different variables change the impact.